39 research outputs found
Cross-lingual transfer learning and multitask learning for capturing multiword expressions
This is an accepted manuscript of an article published by Association for Computational Linguistics in Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019), available online: https://www.aclweb.org/anthology/W19-5119
The accepted version of the publication may differ from the final published version.Recent developments in deep learning have prompted a surge of interest in the application of multitask and transfer learning to NLP problems. In this study, we explore for the first time, the application of transfer learning (TRL) and multitask learning (MTL) to the identification of Multiword Expressions (MWEs). For MTL, we exploit the shared syntactic information between MWE and dependency parsing models to jointly train a single model on both tasks. We specifically predict two types of labels: MWE and dependency parse. Our neural MTL architecture utilises the supervision of dependency parsing in lower layers and predicts MWE tags in upper layers. In the TRL scenario, we overcome the scarcity of data by learning a model on a larger MWE dataset and transferring the knowledge to a resource-poor setting in another language. In both scenarios, the resulting models achieved higher performance compared to standard neural approaches
Automatic identification and translation of multiword expressions
A thesis submitted in partial fulfilment of the requirements of the
University of Wolverhampton for the degree of Doctor of Philosophy.Multiword Expressions (MWEs) belong to a class of phraseological phenomena
that is ubiquitous in the study of language. They are heterogeneous
lexical items consisting of more than one word and feature lexical, syntactic,
semantic and pragmatic idiosyncrasies. Scholarly research on MWEs benefits
both natural language processing (NLP) applications and end users.
This thesis involves designing new methodologies to identify and translate
MWEs. In order to deal with MWE identification, we first develop datasets
of annotated verb-noun MWEs in context. We then propose a method which
employs word embeddings to disambiguate between literal and idiomatic usages
of the verb-noun expressions. Existence of expression types with various
idiomatic and literal distributions leads us to re-examine their modelling and
evaluation. We propose a type-aware train and test splitting approach to
prevent models from overfitting and avoid misleading evaluation results.
Identification of MWEs in context can be modelled with sequence tagging
methodologies. To this end, we devise a new neural network architecture,
which is a combination of convolutional neural networks and long-short
term memories with an optional conditional random field layer on top. We
conduct extensive evaluations on several languages demonstrating a better
performance compared to the state-of-the-art systems. Experiments show that the generalisation power of the model in predicting unseen MWEs is significantly better than previous systems.
In order to find translations for verb-noun MWEs, we propose a bilingual
distributional similarity approach derived from a word embedding model that
supports arbitrary contexts. The technique is devised to extract translation
equivalents from comparable corpora which are an alternative resource to
costly parallel corpora. We finally conduct a series of experiments to investigate
the effects of size and quality of comparable corpora on automatic
extraction of translation equivalents
GCN-Sem at SemEval-2019 Task 1: Semantic Parsing using Graph Convolutional and Recurrent Neural Networks
This paper describes the system submitted to the SemEval 2019 shared task 1 âCross-lingual Semantic Parsing with UCCAâ. We rely on the semantic dependency parse trees provided in the shared task which are converted from the original UCCA files and model the task as tagging. The aim is to predict the graph structure of the output along with the types of relations among the nodes. Our proposed neural architecture is composed of Graph Convolution and BiLSTM components. The layers of the system share their weights while predicting dependency links and semantic labels. The system is applied to the CONLLU format of the input data and is best suited for semantic dependency parsing
WLV at SemEval-2018 task 3: Dissecting tweets in search of irony
International Workshop on Semantic Evaluation. WLV at SemEval-2018 Task 3.This paper describes the systems submitted to
SemEval 2018 Task 3 âIrony detection in English
tweetsâ for both subtasks A and B. The
first system leveraging a combination of sentiment,
distributional semantic, and text surface
features is ranked third among 44 teams according
to the official leaderboard of the subtask
A. The second system with slightly different
representation of the features ranked ninth
in subtask B. We present a method that entails
decomposing tweets into separate parts.
Searching for contrast within the constituents
of a tweet is an integral part of our system.
We embrace an extensive definition of contrast
which leads to a vast coverage in detecting
ironic content.Research Group in Computational Linguistic
Bilingual contexts from comparable corpora to mine for translations of collocations
Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing2016Due to the limited availability of parallel data in many languages, we propose a methodology that benefits from comparable corpora to find translation equivalents for collocations (as a specific type of difficult-to-translate multi-word expressions). Finding translations is known to be more difficult for collocations than for words. We propose a method based on bilingual context extraction and build a word (distributional) representation model drawing on these bilingual contexts (bilingual English-Spanish contexts in our case). We show that the bilingual context construction is effective for the task of translation equivalent learning and that our method outperforms a simplified distributional similarity baseline in finding translation equivalents
Using gaze data to predict multiword expressions
In recent years gaze data has been increasingly used to improve and evaluate NLP
models due to the fact that it carries information about the cognitive processing
of linguistic phenomena. In this paper we
conduct a preliminary study towards the
automatic identification of multiword expressions based on gaze features from native and non-native speakers of English.
We report comparisons between a part-ofspeech (POS) and frequency baseline to:
i) a prediction model based solely on gaze
data and ii) a combined model of gaze
data, POS and frequency. In spite of the
challenging nature of the task, best performance was achieved by the latter. Furthermore, we explore how the type of gaze
data (from native versus non-native speakers) affects the prediction, showing that
data from the two groups is discriminative
to an equal degree. Finally, we show that
late processing measures are more predictive than early ones, which is in line with
previous research on idioms and other formulaic structures.Na
Cognitive processing of multiword expressions in native and non-native speakers of English: evidence from gaze data
Gaze data has been used to investigate the cognitive processing of certain types of formulaic language such as idioms and binominal phrases, however, very little is known about the online cognitive processing of multiword expressions. In this paper we use gaze features to compare the processing of verb - particle and verb - noun multiword expressions to control phrases of the same part-of-speech pattern. We also compare the gaze data for certain components of these expressions and the control phrases in order to find out whether these components are processed differently from the whole units. We provide results for both native and non-native speakers of English and we analyse the importance of the various gaze features for the purpose of this study. We discuss our findings in light of the E-Z model of reading
Language resources for Italian: Towards the development of a corpus of annotated Italian multiword expressions
Napoli, Italy, December 5-7, 2016This paper describes the first resource annotated for multiword expressions (MWEs) in Italian. Two versions of this dataset have been prepared: the first with a fast markup list of out-of-context MWEs, and the second with an in-context annotation, where the MWEs are entered with their contexts. The paper also discusses annotation issues and reports the inter-annotator agreement for both types of annotations. Finally, the results of the first exploitation of the new resource, namely the automatic extraction of Italian MWEs, are presented
Wolves at SemEval-2018 task 10: Semantic discrimination based on knowledge and association
This paper describes the system submitted to
SemEval 2018 shared task 10 âCapturing Discriminative
Attributesâ. We use a combination
of knowledge-based and co-occurrence
features to capture the semantic difference between
two words in relation to an attribute. We
define scores based on association measures,
ngram counts, word similarity, and ConceptNet
relations. The system is ranked 4th (joint)
on the official leaderboard of the task.Research Group in Computational Linguistic
Language Resources for Italian: towards the Development of a Corpus of Annotated Italian Multiword Expressions
Questo contributo descrive la prima risorsa italiana annotatata con polirematiche. Sono state preparate due versioni del dataset: la prima con una
lista di polirematiche senza contesto, e la seconda con annotazione in contesto.
Il contributo discute le problematiche emerse durante lâannotazione e riporta il grado di accordo tra annotatori per entrambi i tipi di annotazione. Infine vengono presentati i risultati del primo impiego della nuova risorsa, ovvero lâestrazione automatica di polirematiche per lâitaliano.This paper describes the first resource annotated for multiword expressions (MWEs) in Italian. Two versions of this dataset have been prepared: the first with a fast markup list of out-of-context MWEs, and the second with an in-context annotation, where the MWEs are entered with their contexts. The paper also discusses annotation issues and reports the inter-annotator agreement for both types of annotations. Finally, the results of the first exploitation of the new resource, namely the automatic extraction of Italian MWEs, are presented